The Anjials of Applied Statistics 
2011, Vol. 5, No. 4, 2266-2277 
DOI: 10.1214/10-AOAS408 

@ Institute of Mathematical Statistics, 2011 



THE DUALITY DIAGRAM IN DATA ANALYSIS: EXAMPLES OF 
MODERN APPLICATIONS 

By Omar De la Cruz^ and Susan Holmes^ 

Stanford University 
Today's data-heavy research environment requires the integration 
of different sources of information into structured data sets that can 
not be analyzed as simple matrices. We introduce an old technique, 
known in the European data analyses circles as the Duality Diagram 
Approach, put to new uses through the use of a variety of metrics 
and ways of combining different diagrams together. This issue of the 
Annals of Applied Statistics contains contemporary examples of how 
this approach provides solutions to hard problems in data integration. 
We present here the genesis of the technique and how it can be seen 
as a precursor of the modern kernel based approaches. 

1. Introduction. Multivariate statistical methods have been used for ma- 
ny decades to deal with situations in which two or more variables are mea- 
sured or recorded for each unit. 

A classical example of this situation is Guerry's data set, in which several 
variables meant to capture "moral qualities" (e.g., literacy, crime rate, sui- 
cide rate) were tabulated for each of the departments in which France was 
divided at the time (1833). This data set suggests that one can be interested 
in how the variables change as one moves around in France, or one can be 
interested in how the departments compare to each other based on the mea- 
sured characteristics. It is of special interest how these two approaches can 
be combined; this is considered in detail in Dray and Jombart (2011). 

Besides having a combination of two essentially multidimensional sources 
of information, like geographic location plus recorded data, another layer of 
complexity is added when one more variable like time is added, leading to 
what essentially are two or more data cubes. A typical example of this is 
ecological data, where species abundances are measured at different, speci- 
fied locations, over the course of time. The different approaches used in this 
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setting are reviewed in Thioulouse (2011), using the duality diagram setting 
as a unifying framework. 

Such an approach is not limited to animal or plant species spread over 
a geographic area; the advent of metagenomics has made it possible to study 
the abundances of bacterial species in locations like ocean or pond water, 
or even the human gut. In this case it becomes important to incorporate 
information not only about location in space, but also in the phylogenetic 
landscape, by using established or inferred phylogenetic trees for the bacte- 
rial species detected. This problem is addressed, using the duality diagram 
formalism, in Purdom (2011). 

As an outgrowth of methods favored by French statisticians, Cailliez and 
Pages (1976) proposed a unifying framework capable of including many 
methods reinvented and used by different groups in different countries as 
special cases. This framework is based on the analysis of certain linear op- 
erators between inner-product spaces which can be naturally associated to 
a data matrix, in the same way Kernel matrices are used today in machine 
learning Scholkopf, Smola and Muller (1998). This is explained in detail in 
works like Escoufier (2006) and Holmes (2006). In this article we present 
some motivations behind the choices made for this approach in the accom- 
panying papers. 

The notion of duality is everywhere in Mathematics, appearing under 
different guises in the most diverse fields; and it is often remarkably useful. 
The idea of duality was introduced in the analysis of multivariate data by 
the French school of data analysts as a way to unify a suite of methods that 
turned out to be exactly or almost exactly equivalent to methods known by 
a different name, and the duality-diagram formalism provides a simple way 
to put all these methods in the same context. 

Since this approach is the basis of the special articles presented together 
here [Dray and Jombart (2011), Purdom (2011), Thioulouse (2011)], this 
short introduction aims to establish the basic facts and notation. The ab- 
stract approach in the duality diagram setup is often intimidating and it 
possibly turns away some interested readers; we hope we can show here 
that these notions are actually natural, and that the overhead due in under- 
standing the notation pays off handsomely in the breadth and complexity 
of applications. 

2. The data matrix as an operator between inner-product spaces. Today 
the distinction between the space of rows of the matrix as a sample from 
a population and the space of columns as the fixed variables on which the 
observations were measured has been softened and we often hear the term 
'transposable' data. The definitions presented here explain this row-column 
duality. 

By dispensing of the traditional probabilistic sample-population interpre- 
tation, European data analysts in the 1970s [Benzecri (1973), Cailliez and 
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Pages (1976), Gifi (1990)] can be seen in hindsight as precursors of the cur- 
rent Machine learning schools. It is interesting to remember that all these 
schools had precursors who spent time at the AT&T laboratories in New 
Jersey at a time when John Tukey was active there. 

Consider an n x p matrix X containing data for variables Vi, . . . ,Vp col- 
lected from n individuals or units. This matrix defines an operator Lx : IR^ — )• 
R" by the rule v i— )• Bv. What interpretation can we give to such a map? The 
vector V can be considered to contain the coefficients for linearly combining 
the variables Vi,. . . ,Vp into a new, synthetic, variable. In that sense, it be- 
comes apparent that actually we should consider this a map from MP*, the 
dual space of W, into M". The map Lx provides a way to fill in the n values 

for the new synthetic variable V = viVi -\ + VpVp, which could have been 

defined even before collecting the data. From now on we will abuse notation 
and identify the operator Lx and the matrix X (and will do the same with 
other similarly defined operators and matrices). We have then the following 
portion of the diagram: 

MP* 



2.1. Adjoint operators as a useful formalism. Recall that the adjoint of 
a linear transformation T : Vi — t- V2 between inner product spaces is defined 
as the mapping T* : V2 — )• Vi that satisfies 

{Tu,z)2 = {u,T*z)i yueVi,yzeV2 

(for simplicity, we will only consider spaces with scalars in M in this article; 
this is enough for most data analyses). This can be seen as just a clever way 
of extending the notion of matrix transpose to a more general setting, but 
it is actually a powerful formalism, especially when dealing with multiple 
inner products on the same spaces (notice that T* depends not only on T 
but also on (•, •)! and (•, •)2). 

It is convenient sometimes to think of T* as a map from V2 to ^li this 
matches the corresponding situation when it is generalized to Banach spaces. 
In our setting this distinction might be considered moot, since all spaces 
considered are naturally isomorphic to their duals, but we will continue using 
the star notation; the diagrams, and all the matrix operations obtained from 
them, work equally well if the stars are dropped from the spaces. Then we 
have the following: 

RP* > 
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Since we are considering the standard inner products on W and M" (and 
their dual spaces), X* corresponds just to the transpose X-^ of X. Thus, 
X*X = X-^X, XX* = XX"^, these two symmetric matrices have the same 
eigenvalues (except possibly for zeros to account for the difference between p 
and n), and the two sets of eigenvectors can be used to form the singular 
value decomposition (SVD) of X [Golub and Van (1996)]. 

2.2. The general duality diagram. Consider now the situation where the 
inner products (i.e., the geometries) on and M" are not standard. That is, 
assume that there are symmetric, positive definite matrices Qpxp and D„xn 
such that the inner products 

(u,v)q:=u^Qv Vu,vGRP 

and 

(w,z)d:=w^Dz Vw,zGM" 

somehow make more sense for a particular data analysis than the stan- 
dard inner products. A typical example is when D is a diagonal matrix 
of (positive) weights, one for each individual, down-weighting individuals 
that are known to have been measured with a larger error; another example 
is when Q is the diagonal matrix containing the reciprocals of the sample 
variances for the columns of X, which corresponds to standardizing the vari- 
ables (assuming they are already centered); a related example is when Q is 
the inverse of the sample variance-covariance matrix obtained from X, in 
which case the new geometry corresponds to the Mahalanobis distance. Of- 
ten, we want to consider the case in which Q = LL^ and we are interested 
in a set {Ui, . . . , Up} of transformed variables obtained from {Vi, . . . , Vp} by 
multiplication by L, leading to a transformed data matrix Y = XL. 

Different multivariate procedures can be obtained by appropriately choos- 
ing Q and D; see Section 3 for some examples. 

Instead of X and its adjoint, consider now the transformation XQ — )■ 
M". That is, a vector v of coefficients is first transformed into L-^v, which 
is in the scale of the transformed data matrix Y = XL, and then used to 
create a linear combination of the variables Ui, . . . , Up. Then, for all u S M^, 

(XQu, z)d = (XQu)^Dz = u^QX^Dz = (u, X^Dz)q, 

so (XQ)* = X^D. Then, (XQ)*XQ = X^DXQ and XQ(XQ)* = XQX^D 
are self-adjoint operators on {W,{-,-)q) and (M", (•, •)d), respectively, but 
they are not necessarily symmetric matrices. Nevertheless, they have real 
eigenvalues (which match, except for zeros to account for the difference 
between p and n), because they are similar to symmetric matrices by way 
of positive definite matrices; for example, 

Qi/2(x^DXQ)Q-i/2 ^ q1/2x^dXQ1/^ 

where Q" is obtained by replacing each eigenvalue A of Q with A". 
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The eigenvectors of X DXQ and XQX D are also real; however, they 
need not be orthogonal. Nevertheless, the eigenvectors of X-^DXQ can be 
taken to be orthogonal with respect to (•,-)q, and those of XQX^^D to be 
orthogonal with respect to (•,-)d- (This can be interpreted as leading to 
a generalized version of the SVD.) 

In diagram form, we have the following: 




This way, the triplet of matrices (X,Q,D) defines a multivariate data 
analysis setup, in which the main strategy is the computation of the eigen- 
decompositions of the matrices X^DXQ and XQX"^D. It is also customary 
to denote V = X-^DX and W = XQX"^, so that the two operators of inter- 
est become VQ and WD. 

The eigendecomposition can be computed for the smaller of the two 
matrices (which is usually much smaller than the other), and, if needed, 
the eigenvectors for the other one can be easily obtained: for example, if 
X"^DXQv = Av, then w := XQv satisfies 

XQX^Dw = XQ(X^DXQv = XQ(Av)) = Aw. 

Furthermore, orthogonality is also preserved among eigenvectors: if vi,V2 
are Q-orthogonal eigenvectors for X^DXQ, then 

(XQvi,XQv2)d = vfQX^DXQv2 = vfQ(Av2) = A(vi,V2)q =0. 

Thus, whole eigendecompositions are easily transferred. 

2.3. Connections with kernel methods. The operator XQX^D can be 
seen as a precursor of the more general kernel matrices used in today's 
kernel PCA type methods [Scholkopf, Smola and MuUer (1998)]. 

In the kernel approach to data analysis one assumes that the data is 
provided as a n x n matrix K containing proximity scores for each pair of 
individuals; these scores might have been computed from measured variables 
(as in the case of the matrix XQX"^D, the linear kernel; nonlinear functions 
of the variables offer a great variety of other possibilities), or by directly 
comparing the individuals; see Scholkopf, Tsuda and Vert (2004) for a review 
of the theory and examples of applications in computational biology. 
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Kernel matrices are similarity matrices, that is, the proximity score for 
two individuals is high when they are similar and low (even negative) when 
they are dissimilar. (Distance matrices, on the other hand, have higher val- 
ues for dissimilar pairs of individuals.) What is the meaning of K as an 
operator on M"? A useful interpretation is to consider it as a smoothing 
operator acting on real-valued functions / defined on the set of individuals: 
the value of K/ at the ith individual is a weighted sum of all the values 
of /, with higher weights for those individuals more similar to the ith; in 
other words, / is averaged locally (up to a multiplicative constant). This 
offers one explanation of why the eigendecomposition of K is useful, since 
repeated application of K to / G M" (which should produce a very smooth 
function) converges toward an eigenvector of the leading eigenvalue, and 
"very smooth" functions of the data points can be used as coordinates. 

Purdom (2011) explores the similarities between the duality diagram and 
kernel approaches in Appendix B, for the case of kernel Canonical Corre- 
spondence Analysis. 

3. Examples of well-known methods as particular cases of the diagram. 

Here we briefly describe how some well-known multivariate methods can 
be expressed as particular cases of the duality diagram, by appropriately 
choosing Q and D. We will assume that X is centered by columns (i.e., the 
mean has been subtracted for each variable). 

3.1. Principal components analysis (PCA). PCA seeks to find linear 
combinations of the variables that explain most of the variability in the 
data; see Mardia, Kent and Bibby (1979), for example, for more details. 

Take Q = Ip, and D = ;^In- This corresponds to PCA in the original scales; 
it is equivalent to a straightforward SVD on X (except for the factor 1/n). 

If one standardizes the variables, as it is often appropriate to eliminate 
unit scale effects, then Q is taken to be the diagonal matrix containing 
the reciprocals of the sample variances of the columns of X (so L contains 
the reciprocals of the standard deviations) . While the ith eigenvector Vj of 
the (D-weighted) sample covariance matrix Y^DY provides the loadings of 
the variables Ui, . . . ,Up for the ith principal component (so that the actual 
components have to be obtained by p, = Yvj/-v/Ai), Pi can be obtained 
directly as an eigenvector for XQX^D: indeed, 

XQX'^Dpi = XQX'^DYvi/\/A^ = XLL^X^DYvi/y^ 

= YY^BY^rJ^/Xi = YAiV,/0^ = A.p^. 

Computing the principal components pi does not require the explicit de- 
composition of Q as LL"^. 
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3.2. Correspondence analysis (CA). A total of m observations are clas- 
sified according to two categorical variables, one with n categories or levels, 
and the other with p, producing a n x p matrix N of counts for each com- 
bination of levels (a contingency table). One wants to study how the counts 
differ from the expected counts under the assumption of independence be- 
tween the two variables. To cast CA as a duality diagram, we first define the 
frequency matrix F = N/m and the marginal frequency vectors c = F^l„xi 
and r = Flpxi; then, the expected counts (conditionally on the marginals) 
are given by nrc^ . Using the matrices = diag(r) and Dc = diag(c), we 
can standardize F by 

X := B;\F - rc^)D,"i = U;'FU~' - 

The matrix X seems like a reasonable choice to study by eigendecomposition. 
However, all rows and columns have been reduced to the same importance, 
while, heuristically, categories with larger marginal counts should provide 
more accurate information on the distribution of the other variable, and thus 
should be given greater weight. This can be achieved by defining the triplet 
(X, Dc,Dr). Notice that actually X is centered by rows and by columns 
with respect to the inner products given by and Dc. This approach 
matches the traditional definition of CA. Purdom (2011) shows how the 
information about relationships between the rows of the contingency table 
can be incorporated into the duality diagram in the special case where there 
are binary trees that connects the rows of the abundance matrix. 

3.3. Variance, inertia, co-inertia. The study of variability of one con- 
tinuous variable is done through the use of the variance; this notion is gen- 
eralized in several different directions to accommodate the complexities of 
dealing with multiple tables, graphs, etc., through the concept of inertia. As 
in physics, we define inertia as a weighted sum of squared distances of the 
weighted points. For each of the diagrams studied above, the inertia desig- 
nates the trace of the operator WD, and we have Inertiatotai = tr(VFL)) = 
tr{VQ). As pointed out in Purdom (2011), in the case of CA, the inertia 
is proportional to the statistic, whereas in ordinary PCA it is just the 
total variance of all the variables. In discriminant analysis, the inertia is de- 
composed into between-groups and within-group components; these are also 
used in the BCA analysis [Thioulouse (2011), Dray and Jombart (2011)]. 

The weighted distances between columns have another interpretation in 
ecology and Purdom (2011) shows how they can be associated to different 
measures of diversity. 

The decomposition of total inertia can be seen as a generalization to 
MANOVA which is the special case of a variance decomposition. Purdom 
(2011) uses this effectively to show how to decompose the total diversity 



8 



O. DE LA CRUZ AND S. HOLMES 



across all locations into the average diversity of individual locations and 
plus the average of pairwise dissimilarities of locations. 

Dray and Jombart (2011) use similar decompositions to show what part of 
the inertia can be assigned to spatially local variation in their BCA approach 
to multivariate spatial data. They also show how the graphical relationships 
between rows can be encoded in a special metric D built from the weighted 
connectivity matrix. (In their paper, they call these weights W.) 

4. One more level of complexity: Comparing diagrams. Interesting re- 
sults can be obtained by combining two or more triplets. The usual assump- 
tion is that two (or more) sets of variables are measured on the same set 
of n individuals; thus, the matrix D is assumed to be common, but each set 
of variables has its own version of Q, of the appropriate size. 

For example, one of the triplets might contain data from variables mea- 
sured on each of the individuals, while the other might encode known rela- 
tionships between the individuals. 

4.1. The RV coefficient. A key element in the comparison of the op- 
erators arising from two duality diagrams is the RV coefficient. It can be 
considered as a generalization of the squared correlation coefficient by using 
the Proebenius matrix product. 

Given two symmetric matrices A, B of the same size, we define COVV(A, 
B) =tr(AB), and 

tr(AB) 



RV(A,B) 



^tr(AA)tr(BB)' 



whenever A,B 7^ 0. Many nice properties of these definitions arise from the 
fact that tr(AB) defines an inner product on the vector space of symmet- 
ric matrices of a given size. This can be adapted to the general setting of 
multiple duality diagrams: having D fixed, call 5(D) the vector space of D- 
symmetric matrices, that is, matrices satisfying DA = A"^D (equivalently, A 
is self-adjoint with respect to (•, •)d)- Then tr(AB) defines an inner product 
on 5(D). 

When comparing two duality diagrams (Xi, Qi, D), (X2, Q2, D), then 
numbers pi,P2 of variables might be different, yielding matrices ViQi, V2Q2 
of different size; however, we will be comparing the matrices (operators) WiD 
and W2D, which are of the same size and D-symmetric. We define the RV 
coefficient of the two diagrams as RV(WiD, W2D). 

Some immediate properties of the RV coefficient are as follows: its val- 
ues are always in [0,1]; it equals 1 only when Wi = 7^ 0, for some 
nonzero scalar a; and it equals only when X^[^DX2 = (provided Qi,Q2 
are nonsingular). The proofs are not too hard; more details can be found in 
Escoufier (2006). 



INTRODUCTION TO THE DUALITY DIAGRAM 



9 



The RV coefficient between diagrams (or triplets) can be used for justi- 
fying the use of eigenvalues and eigenvectors in this setting. For example, 
performing PCA based on (X, Q,D) and selecting the q top components is 
equivalent to finding a matrix Z^xg such that the RV coefficient between 
(X,Q,D) and (Z,Ig,D) is maximized. 

4.1.1. PCA with respect to instrumental variables. When one data set Y 
has the special status of a response that we would like to predict or explain 
from the other data set X of explanatory variables, we can generalize ordi- 
nary regression to a multivariate response through the same diagram frame- 
work. This is called PCA with respect to instrumental variables^ abbreviated 
PCA-IV (also known as redundancy analysis, RDA), first described by Rao 
(1964). In terms of the comparison of duality diagrams and RV coefficients, 
this problem can be rephrased as that of finding the metric M to associate 
to X so that (X, M,D) is as close as possible to (Y,Q,D) in the RV sense. 
That is, we want to maximize i?F(XMX^D, YQY'^D). We abbreviate the 
cross-products by writing 



X"^DX — Sxxi Y"^DY — Syy, X"^DY — Sxy 



and 



■f^ ~ ^xx ^xy^^yx^xx • 

Then for any R 

1 1 YQ Y^D-XMX^D f = 1 1 YQ Y^D-XRX^D XRX'^D-XMX^D f. 

The first term on the right-hand side does not depend on M, and the second 
term will be zero for the choice M = R. 

If we add the extra constraint that we only allow ourselves a rank q ap- 
proximation, with q < min{rank(X), rank(Y)}, the optimal choice of a pos- 
itive definite matrix M is to take M = RBB"^R where the columns of B 
are the eigenvectors of X-^DXR with 

1 „ 1 

X^DXR/3, = XkP„ 
such that ^ /3^R/3fc = A^, /c = 1, . . . , g, 

Ai > A2 > ••• > Ag. 

The PCA with regards to instrumental variables of rank q is equivalent to 
the PCA of rank q of the triple (X, R, D) where 

^ ~ ^xx ^xyCl^yx^xx ■ 



B=(^^i,...,-=^^ 
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4.2. Comparing more than two diagrams. Consider k diagrams (Xi,Qi, 
D), . . . , (Xk, Qfc, D). This could correspond, for example, to k different stud- 
ies on the same subjects, using different variables, or the same variables 
measured on the same units at different points in time (time course study); 
a review of this problem in the setting of community ecology is found in 
Thioulouse (2011). It is often important to summarize the relationships be- 
tween the diagrams in a compact and intelligible way. The RV coefficients in 
fact allow us to consider this as performing a PCA of the PCAs. We compute 
the multivariate correlation coefficients between tables and use those as the 
matrix to be diagonalized, similarly to what happens in ordinary PCA. 

The values of the pairwise computations of the COW and RV coefficients 
are arranged into k x k symmetric matrices C and R, respectively, and 
the eigendecomposition of these matrices can lead to useful low-dimensional 
representations, just as in the case of PCA using the covariance or correlation 
matrices, respectively. In this case, a 2- or 3-dimensional plot can be created 
in which each point represents one of the studies (diagrams). 

Furthermore, since C and R have nonnegative entries, the eigenvector ui 
corresponding to the largest eigenvalue can be taken to have only nonnega- 
tive entries, adding up to 1. Then, defining W = J2i=i ^iiWj, the operator 
WD can be taken as a compromise or summary of all the diagrams, and 
one can study how far, in the RV sense, different studies are from the com- 
promise. 

These steps are part of the so-called STATIS procedure [Escoufier (1980)]. 
One can think of these data sets as a data cube, with three indices; then 
a similar procedure can be used to compare two or more such cubes. 

5. Conclusions. The duality diagram is a useful formalism that allows 
one to easily compare many classical multivariate methods, revealing what 
they have in common and where they differ. But, furthermore, it has become 
a valuable tool for dealing with two problems that have become very com- 
mon: (1) combining and amalgamating data which, although collected from 
different sources and using different methods, shed light on different aspects 
of the same phenomenon; and (2) taking advantage of complex, nontradi- 
tional data types, like tree and network information. These two problems are 
closely related, as the data to be amalgamated are often of complex type. 

The overhead in effort to understand the abstract definitions in the duality 
diagram approach to data analysis is amply offset by the clearer picture that 
is gained and by the wealth of applications that become available. In this 
article we have tried to reduce that overhead by laying out arguments that 
show that those definitions are actually quite natural. The three articles 
[Dray and Jombart (2011), Thioulouse (2011), Purdom (2011)] in this group 
are excellent examples of the power of this approach, but are only a small 
sample from a large and growing body of work. 
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Recently, Shinkareva et al. (2008) have used the RV coefficient and STATIS 
approaches to explore fMRI brain activation in conjunction with stimula- 
tions such as images of tools. 

A series of papers [Culhane et al. (2002), Culhane, Perriere and Higgins 
(2003), Fagan, Culhane and Higgins (2007)] have applied BCA and Co- 
Inertia analyses to the problem of integrating multiple sources of data from 
heterogeneous gene expression and proteomic studies. 

Baty et al. (2006) used the PCAIV method to identify special genes in 
microarray data and Baty et al. (2008) used bootstrap and permutation type 
tests for evaluating the stability of the gene identifications produced. 

Most of the methods presented in these papers have been coded into func- 
tions for the statistical computation environment R [Ihaka and Gentleman 
(1996)], many available in the library ade4, for which exemplary presen- 
tations have been published [see Chessel, Dufour and Thioulouse (2004), 
Dray, Dufour and Chessel (2007), Dray and Dufour (2007)]. In the case of 
Thioulouse (2011), you can even run in an interactive way through all the 
commands generating each and every plot through the reproducible website 
at http : //pbil . univ-lyonl . f r/SAOASOPET/. 
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